Verbal Case Frame Acquisition From A Bilingual Corpus: Gradual Knowledge Acquisition
نویسنده
چکیده
This paper describes acquisilion of English stillace case flames from a corpus, based on a gradual knowledge acquisition approach. To acquire and unambiguously accumulate precise knowledge, the process is divided inln three steps which are assigned to the most appropriate processor: either a human or a computer. The data is prepared by human workers and the knowledge is acquired and accumulated by a leaning program. By using this method, inconsistent hunmn judgement is minimized. The acquired case frames basically duplicate Imman work, but are more precise and intelligible. 1 Gradual Knowledge Acquisition We have been developing an English-to-Japanese nutchine translation (MT) system (i~t news reports in l-nglish (Aizawa T., 1990) (Tanaka I I., 1991) and have so far studted the translation selection problem for common English verbs (Tanaka I1., 1992). Recently, we examined the problem of multiple translatkms for COllll/lOl] English verbs (Tanaka [1., 1993). Our MT system uses surface verbal case flames (simply written its case frames) to selccl a Japanese translation for an English verb. The need to acqtuirc and accumttlate case frames leads directly to three problems. (1) How to obtain detailed case frames which are accurate enough to mmslate highly polysemous verbs? (2) ltow to accumnlate a number o1' case frames in an unambiguous way. (3) Manual case frame acquisition tends to yield inconsistent results since human judgements are changeable. [Iow can we maintain cousistency? We need to devise a cleat' methodology lor acquiring suf-ficient case flames and accuumlating them in a way that is unambiguous and consistent. In this paper, we propose a gradually building up a knowledge base from a bilingual corpus to cope with these three problems. The knowledge base is a collection of case fiames. Fig. 1 shows an overall view of otn approach. The process is divided into three steps which arc assigned to the most appropriate processor: a hmnan or a computer. Using this method, detailed knowledge is obtained fiom the Fig . 1 : Case-Frame Tree Acquisition from a Bilingual Corpus target &)main tents, unstable hmnan judgement is confined, and case IYames are accumtdated unambiguously by using a lemning algorithm. We begin by preparing a tagged bilingual corpus seeking detailed knowledge in target domain texts. The annotation described in the corpus is tile syntactic information of tile texts and tile translaliot~. They are assigned manually since hnman translators can do such jobs as syalactic lagging and translation with far more cousistency than writing case frames directly. Next, tile corpus is converted into an intermediate data form called the primitive case-flame table (PCI'T). Finally a stalistical learning algorilhm is used to extract the case frames from the PC['T and accuuuulate them in a clear-cut fashion. While this approach let us avoid writing case flames directly using linguistic contemplation, human activity plays an important role in designing and constructing the corpus and converling it into the PCIq' (Fig. 1). The case frames are represented in a discrimination tcee, which has sev01al attractive features lor word-sense selection (Okunmra M., 1990). The biggest attraction of the learning algorithm, we think, is its intelligibility; compared with the algorithms for neural networks, for example, it produces highly intelligible results if the inpul is appmpri-
منابع مشابه
Class-based Sense Classi cation of Verbal Polysemy in Case Frame Acquisition from Parallel Corpora
In the eld of statistical analysis of natural language data, the measure of word/class association has proved to be quite useful for discovering a meaningful sense cluster in an arbitrary level of the thesaurus. In this paper, we apply its idea to the sense classi cation of Japanese verbal polysemy in case frame acquisition from Japanese-English parallel corpora. A measure of bilingual class/cl...
متن کاملSense Classi cation of Verbal Polysemy based-on Bilingual Class/Class Association
In the eld of statistical analysis of natural language data, the measure of word/class association has proved to be quite useful for discovering a meaningful sense cluster in an arbitrary level of the thesaurus. In this paper, we apply its idea to the sense classi cation of Japanese verbal polysemy in case frame acquisition from Japanese-English parallel corpora. Measures of bilingual class/cla...
متن کاملDecision Tree Learning Algorithm with Structured Attributes: Application to Verbal Case Frame Acquisition
The Decision Tree Learning Algorithms (DTLAs) are getting keen attention from the natural language processing research comlnunity, and there have been a series of attempts to apply them to verbal case frame acquisition. However, a DTLA cannot handle structured attributes like nouns, which are classified under a thesaurus. In this paper, we present a new DTLA that can rationally handle the struc...
متن کاملEffect of Cross-Language IR in Bilingual Lexicon Acquisition from Comparable Corpora
Within the framework of translation knowledge acquisition from WWW news sites, this paper studies issues on the effect of cross-language retrieval of relevant texts in bilingual lexicon acquisition from comparable corpora. We experimentally show that it is quite effective to reduce the candidate bilingual term pairs against which bilingual term correspondences are estimated, in terms of both co...
متن کاملAutomatic Translation Template Acquisition Based on Bilingual Structure Alignment
Knowledge acquisition is a bottleneck in machine translation and many NLP tasks. A method for automatically acquiring translation templates from bilingual corpora is proposed in this paper. Bilingual sentence pairs are first aligned in syntactic structure by combining a language parsing with a statistical bilingual language model. The alignment results are used to extract translation templates ...
متن کامل